Author: Hanna-Leena – title: “IODS course project” output: html_document: theme: cosmo toc: true toc_depth: 2 fig_caption: true fig_width: 6 fig_height: 4 —
Difficult course for me. Never done this before. Hoping to learn the basics of R-programme. I heard of the course from my colleagues. Write a short description about the course and add a link to your GitHub repository here https://github.com/Hanna-Leena/IODS-project. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
students2014<-read.csv("learning2014.csv", header = TRUE, sep= ",")
# Looking at the structure
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
head(students2014)
## gender Age Attitude deep stra surf Points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
This data has 166 observations which means 166 students. This data has 7 variables. The 7 variables are gender, age, attitude, deep, stra, surf and points. The attitude variable gives information of the students’ global attitude toward statistics. Points mean exam points and their points related to different aspects of learning (Deep, strategic and surface learning).
pairs(students2014[-1], col=students2014$gender)
summary(students2014)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
This picture is not actually helping me to understand what kind of data I am prosessing. I like the summarize table and actually I use it quite often when getting familiar with my data. Let’s look at it more closely.
summary(students2014)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
There are 110 females and 56 males. Minimum age is 17, median 22, mean 25.51, maximum 55. There are also quartiles you can use; 1st quartile 21, 3rd quartile 27. Exam points are minimum 7, median 23, mean 22.72, maximum 33.0, 1st quartile 19, 3rd quartile 27.75. Global attitude minimum 14, median 32, mean 31.43, max 50, 1st quartile 26, 3rd quartile 37. Deep questions points minimum 1.58, median 3.667, mean 3.68, 3rd quartile 4.083. And so on. Not everybody like these kinds of table but for me it gives quick insight into my data and I use it quite often.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower = list(combo = wrap("facethist", bins = 20)))
p
This way the data is more easily readable. Almost all the variables are normally distributes. The age is skewed. The picture also shows that there is not a very strong correlation between different variables since the correlations are between -0.3 - 0.4.
model <- lm(Points ~ gender + Attitude + stra, data = students2014)
summary(model)
##
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97982 2.40303 3.737 0.000258 ***
## genderM -0.22362 0.92477 -0.242 0.809231
## Attitude 0.35100 0.05956 5.893 2.13e-08 ***
## stra 0.89107 0.54409 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
This test studied the association of exam points (target value, which was asked) with gender, attitude and strategic learning (explanatory variables). It can be seen that the Attitude is the only value that is statistically siginificant (p-value is < 0.05.). Actually the p-value is very low.
I am doing a regression model of with the one significant explanatory variable “Attitude”.
model_2 <- lm(Points ~ Attitude, data = students2014)
summary(model_2)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
#These results mean that the attitude’s estimate is 0.35 and the p-value stays still <0.05 being statistically significant. In other words this means that when the attitude increases by 1 unit the exam points increase by 0.35.
R-squared is a statistical method showing how close the data is to the fitted regression line. Meaning how well does the model fit to my data. In general it can be said that the higher the R-squared, the better the model fits to the data. The definition of R-squared is the percentage of the response variable variation that is explained by a linear model.R-squared = Explained variation / Total variation R-squared, it is always between 0 and 100%. In this summary we can see that the multiple R-squared is 0.1906 so 19% which means that this model explains only 1/5 of the exam points around their mean. R-squared does not determine whether the coefficient estimates and predictions are biased. This is why you must assess the residual plots.
There are several assumptions in the linear regression model. With the help of these plots you can analyze the residuals of the model and see how well the linear regression model is working or are the some problems with it.
plot(model_2, which = c(2,1,5))
This plot shows that the residuals have constant variance. You can find an equally spread residuals around the horizontal line without distinct patterns.
With the Q-Q plot you can explore that the residuals are normally distributed. As you can see the points are very close to the line. There are the upper and lower tails which have some deviation. I think this is acceptable. I would interpret that the errors are normally distributed.
This plot helps you to understand if there are outliers in the data that are influencial in the linear regression model. In this analysis all the cases are inside the Cook’s distance lines.
This week we will learn about logistic regression. Before analysing the data I have been done some wrangeling. It wasn’t as diffucult as week before. Now the data is ready to be analyzed. The following changes have been made: the variables not used for joining the two data have been combined by averaging. The variable ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’and ’high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.
alc <- read.csv("alc2014.csv", header = TRUE, sep = ",")
#Looking at the structure
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
summary(alc)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
The modified data consists of 382 observations of 35 variables as seen above. The data tells students’ achievements in secondary education of two Portuguese schools. This data was collected by questionaires and school reports. Originally we had two datasets regarding the performance in two distinct subjects: Mathematics (called mat) and Portuguese language (called por). If you want, you can find more information about the original dataset here: https://archive.ics.uci.edu/ml/datasets/Student+Performance.
To do this I choosed 4 variables in the data. And for each of them I had present my own personal hypothesis of their relationship with alcohol consumption. I chose the variables failure, absences, famrel and higher. 1. Variable “failure” tells about the number of past class failures which might be linked to higher alcohol consumption. When you think with common sense these two variables could have correlation, which is why I chose them. 2. The second chosen variable is “absences”. The variables measures school absences. Again when you think with common sense it could be explainde that those who drink more alchol have more school absences. Thus is why I want to see is there correlation or not. 3. The third chosen variable is “famrel”. It means the quality of family relationships. It is commonly known that lower social status is connected to more alchol consumption. Now we will find out is there correlation between these two variables in this data. 4. The fourth chosen variable is “higher”. This variable tells if the student wants to achieve higher education. It can be argued that those who study more and harder, have lesser time for partying and for instance alchol use. So we will find out it those who want a higher education consume alcohol less.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
table_1 <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)
addmargins(table_1)
## number_of_past_class_failures
## high_use 0 1 2 3 Sum
## FALSE 244 12 10 2 268
## TRUE 90 12 9 3 114
## Sum 334 24 19 5 382
failurel <- ggplot(alc, aes(x = high_use, y = failures))
failurel + geom_count()
From this table and plot-picture you can see that the ones who have no past class failures consume less alcohol. Those ones who consume alcohol more have more class failures.
g <- ggplot(alc, aes(x = high_use, y = absences))
g2 <- g + geom_boxplot() + ggtitle("High alcohol consumption and absences")
g2
From this picture we can see that those ones who consume less acohol they have less school absences. Those onse who consume more alcohol have more school absences. So there might be correlation between these two variables.
g3 <- ggplot(alc, aes(x = high_use, y = famrel))
g4 <- g3 + geom_boxplot()
g4
The students who consume less acohol have higher scores from their family relationships. Those student who consume more alcohol have lower scores from their family relationships. From this you can not say which one causes which. Do the students start using alcohol because of bad family relationships or do family relationships get worse when the student starts using more alcohol.
table_2 <- table(high_use = alc$high_use, wants_high_education = alc$higher)
table_2
## wants_high_education
## high_use no yes
## FALSE 9 259
## TRUE 9 105
round(prop.table(table_2) * 100, 1)
## wants_high_education
## high_use no yes
## FALSE 2.4 67.8
## TRUE 2.4 27.5
From these two tables you can see that it looks like those who consume more alcohol, they don’t want to have higher education.
m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3593 -0.7890 -0.6902 1.1782 1.8993
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12562 0.74528 -0.169 0.866143
## failures 0.43484 0.19618 2.217 0.026653 *
## absences 0.08341 0.02272 3.671 0.000241 ***
## famrel -0.22708 0.12502 -1.816 0.069322 .
## higheryes -0.36274 0.55177 -0.657 0.510924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.43 on 377 degrees of freedom
## AIC: 446.43
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) failures absences famrel higheryes
## -0.12562333 0.43483926 0.08340567 -0.22707832 -0.36273561
As seen above the p-value is low (<0.05) in failures and absences. In famrel the p-value is close to 0.05. The coefficient in failures and absences is positive (failures 0.44, absences 0.08), meaning that more failures and more absences predict higher alcohol use. Familyrelationship seems to have negative effect (-0.22) on the high alcohol use, which means that better familyrelationship have protective effect on alcohol use. The future education plan meaning the plans to get a higher education have also negative effect on the high alcohol use (-0.4), but the p-value is not significant.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.3 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
OR <- coef(m) %>% exp()
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures 1.5447147 1.0499664 2.282173
## absences 1.0869827 1.0418018 1.139025
## famrel 0.7968584 0.6230747 1.019051
## higheryes 0.6957704 0.2372414 2.121797
From this table you can see the coefficients of the model as oddss ratios and their confidence intervals. The odds ratios for failures is 1.54 (and confidence interval is 1.05-2.28). Variable absences odds ratio is 1.09 (with confidence interval 1.04-1.14). This means that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times. The odss ratio for the better family relationship is 0.8 with confidence interval 0.6 - 1.02, which means that the better family relationship makes the odds for high alcohol consumption 0.8 times less likely. The fourth tested variable the higher educational plan was not statistically significant since the odds ratio is 0.70 and the confidence interval is 0.23-2.12. When a number 1 is included in the confidence interval, this means that there is no statistically significant difference. The findings we see can be said to support the earlier hypotheses we had.
We are going to use the variables which, according to my logistic regression model, had a statistical relationship with high/low alcohol consumption. We are going to explore the predictive power of my model.
m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
## failures absences famrel high_use probability prediction
## 373 1 0 4 FALSE 0.2765114 FALSE
## 374 1 7 5 TRUE 0.3531843 FALSE
## 375 0 1 5 FALSE 0.1764851 FALSE
## 376 0 6 4 FALSE 0.2898242 FALSE
## 377 1 2 5 FALSE 0.2646186 FALSE
## 378 0 2 4 FALSE 0.2262058 FALSE
## 379 2 2 2 FALSE 0.5234763 TRUE
## 380 0 3 1 FALSE 0.3857482 FALSE
## 381 0 4 2 TRUE 0.3523118 FALSE
## 382 0 2 4 TRUE 0.2262058 FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.04712042 0.70157068
## TRUE 0.25392670 0.04450262 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
The training errors result is 0.30 meaning 30%. This means that almost 30% a.k.a 1/3 of observations are incorrectly classified. I do not know if this is very common when doing a model. I think the number is quite high and the model is not working as it should be working.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) >0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.3115183
Even after 10-fold cross validation the result remains the same. So 30% of observations are incorrectly classified. ***
This week we are learning how to cluster and classificate. First we need to load the Boston-data, from the MASS package.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
#Then we explore the structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The data has 506 objects and 14 variables.This dataset is about housing values in suburbs of Boston. The variables are shortened, so for us to understand what they mean I have explainde the variables below. Variables: “Crim” means per capita crime rate by town. “zn” means proportion of residential land zoned for lots over 25,000 sq.ft. “indus” means proportion of non-retail business acres per town. “chas” means Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). “nox” means nitrogen oxides concentration (parts per 10 million) “rm” means average number of rooms per dwelling “age” means proportion of owner-occupied units built prior to 1940 “dis” means weighted mean of distances to five Boston employment centres “rad” means index of accessibility to radial highways “tax” means full-value property-tax rate per $10,000. “ptratio” means pupil-teacher ratio by town “black” means 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town “lstat” means lower status of the population (percent) “medv” means median value of owner-occupied homes in $1000s
First I downlond few packages I could need during the excercise.
library(ggplot2)
library(GGally)
library(tidyverse)
Next we are going to look at the summary of the Boston dataset
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
If you look at the variables more closely, you can see that almost all the variables are normally distributed. You can know this by seeing that median and mean values are more or less close to each other. Variables “Zn” and “Crim” are not normally distributed. The variable “Chas” is a binary variable.
Next were are going to look for the correlations of the dataset.
#First a picture of the graphical overview
pairs(Boston)
This picture is not very clear. It is a picture of the pairs plot. All the blots are so small and so near to each other thatyou can actually see only black picture. From this picture you cannot see the correlations so next we are going to look at the correlations more closely.
#Let's make a correlation matrix and draw a correlation plot
cormatrix <- cor(Boston)
cormatrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormatrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
My R couldn’t find a library corrplot, even tough that was instructed in the datacamp. Finally I found a package of corrplot. So after downloading that I could look at the correlations in this picture, which is more clear to me. There is strong negative correlation (big red ball), between dis-nox, dis-age and dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This seems clear and logical. Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is logical. A positive correlation is marked with a big blue ball. So if you look at the picture, you can see that rad and tax have a strong postive correlation. This means that when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises.
Standardizing the dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# The boston_scaled is a matrix and we are going to change it to a data for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
After the data has been scaled you can see from the summary that all the means and medians are close to each other meaning that now they are normally distributed. This will help us in scaling this dat in a fitted model. Next we are going to change the continuous crime rate variable into a categorical variable. We need to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. Then we are going to drop the old crim variable from the dataset and replace it with the new crime variable.
# Creating a quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# creating a categorical variable crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
dim(boston_scaled)
## [1] 506 14
The data has still 506 objects and 14 variables. The instructions said that we need to divide the dataset to train and test sets, so that only 80% of the data belongs to the train set.
# We are going to make the train and the test sets by choosing 80% observations to the train set and the rest of the observations are in the test set.
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
train <- boston_scaled[ind, ]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 0.585 -0.487 1.871 -0.487 ...
## $ indus : num -0.867 -0.876 -0.437 -1.072 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.343 -0.878 -0.144 -0.61 -0.144 ...
## $ rm : num -0.592 0.679 -1.179 0.839 -0.268 ...
## $ age : num -0.791 -0.894 -1.136 -1.438 0.566 ...
## $ dis : num 0.681982 1.98786 0.000692 1.26815 0.31669 ...
## $ rad : num -0.522 -0.178 -0.637 -0.522 -0.637 ...
## $ tax : num -1.093 -0.737 -0.601 -0.227 -0.601 ...
## $ ptratio: num 0.806 0.575 1.175 -0.395 1.175 ...
## $ black : num 0.441 0.426 -0.741 0.343 0.256 ...
## $ lstat : num -0.4 -0.442 -0.135 -1.126 -0.335 ...
## $ medv : num -0.33 0.268 -0.254 0.942 -0.471 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 2 3 1 3 2 1 1 4 2 ...
So now the train set had 404 observations and 14 variables. This is 80% of the 506 observations we began with.
Next we are creating the test set.
test <- boston_scaled[-ind, ]
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num -0.4872 -0.4872 -0.4872 0.0487 -0.4872 ...
## $ indus : num -0.593 -0.593 -1.306 -0.476 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.74 -0.74 -0.834 -0.265 -0.144 ...
## $ rm : num 0.194 1.281 0.207 -0.388 -0.478 ...
## $ age : num 0.3668 -0.2655 -0.3508 -0.0702 -0.2407 ...
## $ dis : num 0.557 0.557 1.077 0.838 0.433 ...
## $ rad : num -0.867 -0.867 -0.752 -0.522 -0.637 ...
## $ tax : num -0.986 -0.986 -1.105 -0.577 -0.601 ...
## $ ptratio: num -0.303 -0.303 0.113 -1.504 1.175 ...
## $ black : num 0.441 0.396 0.41 0.426 0.441 ...
## $ lstat : num -0.492 -1.2075 -1.0423 -0.0312 -0.6152 ...
## $ medv : num -0.1014 1.3229 0.6706 0.0399 -0.2319 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 2 3 3 3 1 2 2 ...
The test set has 102 observations and 14 variables. This is 20% of the original data set.
We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We are going to draw the LDA (bi)plot of the model.
#Fitting the linear discriminant analysis on the train set
lda.fit <- lda(formula = crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#Target classes as numeric
classes <- as.numeric(train$crime)
#Drawing a plot of the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
#Next we are going to save the crime categories from the test set and then remove the categorical crime variable from the test dataset. After that we are going to predict the classes with the LDA model on the test data.
#We are going to save the crime categories from the test set
correct_classes <- test$crime
library(dplyr)
#Next we are going to remove the categorial crime variable from the test dataset
test <- dplyr::select(test, -crime)
# Next we are going to predict with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
#Then was asked to do a cross table of the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 10 3 0
## med_low 8 9 4 0
## med_high 2 7 14 1
## high 0 0 1 23
Here you can see how the model is working with the predictions. The model works well when predicting the high crime rates. The model makes errors when predicting the other crime classes.
#Next we are going load the Boston dataset again and standardize the dataset. We are going to calculate the distances between the observations. We are going to run k-means algorithm on the dataset. We are going to investigate what is the optimal number of clusters and run the algorithm again. In the end we are going to visualize the clusters and interpret the results.
library(MASS)
data("Boston")
dim(Boston)
## [1] 506 14
We now have loaded the Boston dataset again from the MASS-library. We wanted to check that we have the correct amount of observations 506 and variables 14.
Next we are going to measure the distance. I am going to use the Euklidean-distance, which is probably the most common one. K-means is old and often used clustering method. K-means counts the distance matrix automatically but you have to choose the number of clusters. I made the model with 3 clusters, because my opinion is that it worked better than 4 clusters.
scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)
#Next we are going to calculate the distances, the Euklidean-distance.
dist_eu <- dist(scale_Boston2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# K-means clustering
km <- kmeans(scale_Boston2, centers = 3)
#Plotting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:6], col = km$cluster)
pairs(scale_Boston2[7:13], col = km$cluster)
Next were are going to investigate what is the optimal number of clusters. There are many ways to find out the optimal number of clusters, but we will use the Total of within cluster sum of squares (WCSS) and visualise the result with a plot.
# First determine the number of clusters
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})
# Next we are going to visualize the results
qplot(x = 1:k_max, y = twcss, geom = "line")
The optimal number of clusters is when the total WCSS drops radically. As you can see from the picture above this happens around, when x= 2. So the optimal number of clusters would be 2. Next we run the algorithm again with two clusters.
km <- kmeans(scale_Boston2, centers = 2)
pairs(scale_Boston2, col = km$cluster)
In this first picture all the variables are included. Because there are so many variables I think the picture is quite difficult to interpret and understand. That is why I choose to do two more plots, so that I can look at the effects more closely.
pairs(scale_Boston2[1:8], col = km$cluster)
pairs(scale_Boston2[6:13], col = km$cluster)
As you can see from the pictures above the variable “chas” doesn’t follow any pattern with any of the variables. These kind of pictures are hard for me to understand because I am doing this the very first time. I think, however, that there might be negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.
Dimensionality reduction techniques. I had huge problems with the data wrangling and after all the tricks I did, I couldn’t get the observations and variables correct, before I started to do the analysis part. Soo I had to read the processed data that was handed out to us.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
Looking that we have now the right amount of observations and variables. In the human data after the data wrangling there should have been 155 observations and 8 variables. So as you can see, we have now the correct data in hands.
dim(human)
## [1] 155 8
This week we are going to learn dimensionality and reduction techniques. We will work on a data called the human-data. This dataset originates from the United Nations Development Programme and it is about measuring the development of a country with Human Development Index (HDI). More information about the data can be looked from these two webpages: http://hdr.undp.org/en/content/human-development-index-hdi and http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf.
Like said before my data wrangling did not work and I ended up having the wrong amount of observations and variables, so I did the analysis with the given data.
We have read the data from the webpage and looked that we have now the correct amount of observations and variables. Let’s look at the variable names.
colnames(human)
## [1] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
Next we are going to look at the summary of the data.
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
You can see now the means, the medians and the quartiles of the variables.
head(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6 6.2
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7 3.8
## Parli.F
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
The data consists of 155 countries and 8 variables of each country. You one can see the name of the country and then the variables giving indications about the country’s development.
Empowerment: Edu2.FM = Proportion of females with at least secondary education / Proportion of males with at least secondary education Labo.FM = Proportion of females in the labour force / Proportion of males in the labour force Parli.F = Percetange of female representatives in parliament.
Health and knowledge: Life.Exp = Life expectancy at birth Edu.Exp = Expected years of schooling. GNI = Gross National Income per capita Mat.Mor = Maternal mortality ratio. Ado.Birth = Adolescent birth rate.
#Let's look at the structure of the data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(tidyverse)
library(dplyr)
library(ggplot2)
library(GGally)
#Let's look at a graphical overview of the data
gather(human) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
h <- ggpairs(human, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
h
Here you cann be see the graphical overviews of the data. All the variables are numeric. Only one variable the Edu.Exp is normallyt distributed.
##Next we are going to study the relationships between the variables with correlation matrix
#We are going to do a correlation matrix and draw a correlation plot to look at the relationships between the variables
library(corrplot)
library(tidyverse)
correlation <- cor(human)
correlation %>% round(digits = 2)
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
Then the correlation plot
corrplot.mixed(correlation, lower.col = "black", number.cex = .6)
Here you can see that the percetange of female representatives in parliament (variable Parli.F) or proportion of females in the labour force / proportion of males in the labour force (variable Labo.FM) don’t have any strong correlations with any of the other variables.
The maternal mortality ratio (variable Mat.Mor) and life expectancy (variable Life.Exp) have strong negative correlation, which means that when maternal mortality ratio gets higher life expactancy gets lower. Also adolescence birth ratio (variable Ado.Birth) has strong negative correlation with life expectancy (variable Life.Exp). Higher education (variable Edu.Exp) and variable GNI seems to affect positively to life expactancy (variable Life.Exp).
#Principal Component Analysis
The next step was to do a PCA (princiapl componen analysis) to the non-standardized data. Principal Component Analysis (PCA) can be performed by two slightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD). The function prcomp() function uses the SVD and is the preferred, more numerically accurate method.
#First let's make a PCA with the SVD-method
pca_human <- prcomp(human)
sum_pca_human <- summary(pca_human)
sum_pca_human
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
This is the PCA with the SVD-method.
pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), "(", pca_pr, "%)")
#Drawing the biplot
biplot(pca_human, choices = 1:2, cex = c(0.8,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of non-scaled human data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
As you can see almost all the variables are gathered in a one corner and we have only and one arrow. The other arrows are zero of length and indeterminate in angle so they are skipped. Because the PCA is sensitive to the relative scaling of the original features and it assumes that features with larger variance are more important than features with smaller variance. So without scaling the biplot looks like above. The GNI has the largest variance, so it becomes dominant as you can see.
#Next we are going to standardize the variables in the human data and repeat the above analysis
human_scaled <- scale(human)
str(human_scaled)
## num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## ..$ : chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
## - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 1.32e+01 7.17e+01 1.76e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
## - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 2.84 8.33 1.85e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
Let’s look at the summary of the scaled data
summary(human_scaled)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
As you can see the means and the medians are very close to each other.
class(human_scaled)
## [1] "matrix"
human_scaled <- as.data.frame(human_scaled)
#Let's make the PCA again with using the SVD-method
pca_human_s <- prcomp(human_scaled)
sum_pca_human_s<-summary(pca_human_s)
pca_pr_s <- round(100*sum_pca_human_s$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr_s), " (", pca_pr_s, "%)")
sum_pca_human_var_s<-sum_pca_human_s$sdev^2
sum_pca_human_var_s
## [1] 4.2883701 1.2989625 0.7657100 0.6066276 0.4381862 0.2876242 0.2106805
## [8] 0.1038390
Next let’s draw the biplot
biplot(pca_human_s, choices = 1:2, cex= c(0.5,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of scaled human data")
As you can see now we have more than one arrow. So, the standardization works. Now the relative scaling between the variables is more similar and the GNI, which had the largest variance, doesn’t “run over” the other variables.
Edu.Exp, GNI, Edu.FM and Life.Exp are close together and the arrows share a small angle, which means that these variables have high positive correlation. The arrows of Mat.Mor and Ado.Birth are directed to the opposite direction, which means that they have high negative correlation with the variables mentioned priorly. All these factors have high angle with Labo.FM and Parli.F, which means that there is not a high correlation.
The angle between a feature and a PC axis can be understand as the correlation between the two. Small angle means high positive correlation.
The length of the arrows are proportional to the standard deviations of the variables and they seem to be pretty similar with the different variables.
#Next we are going to do multiple correspondence analysis MCA
library(FactoMineR)
You need to download the FactoMineR package to do the MCA. From this package we will use the “tea” -data for the analysis. After loading the package, let’s look how the tea-data looks like.
data(tea)
colnames(tea)
## [1] "breakfast" "tea.time" "evening"
## [4] "lunch" "dinner" "always"
## [7] "home" "work" "tearoom"
## [10] "friends" "resto" "pub"
## [13] "Tea" "How" "sugar"
## [16] "how" "where" "price"
## [19] "age" "sex" "SPC"
## [22] "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "sophisticated" "slimming"
## [34] "exciting" "relaxing" "effect.on.health"
Here you can see what kind of variables you have in the tea-data.
Let’s look at the summary next
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Let’s look at the structure of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
As you can see the tea-data has 300 observations and 36 variables. Almost all the variables are categorial, with 2 to 7 categories. Only the variable “age” is not a categorial variable.
Next we want to look the graphical overview of the data. With MCA you can analyse the pattern of relationships of several categorical variables. MCA deals with categorical variables, but continuous ones can be used as background variables
For the categorical variables, you can either use the indicator matrix or the Burt matrix in the analysis. The Indicator matrix contains all the levels of categorical variables as a binary variables (1 = belongs to category, 0 = doesn’t belong to the category). Burt matrix is a matrix of two-way cross-tabulations between all the variables in the dataset
The instructions said that you can do MCA for all the columns of just for some of the. Because the tea-data has so many observations and variables, I choosed to use only some of them in the MCA. I think then it is easier for me to interpret. We are also going to make a summary and graphical overview.
#Let's choose the columns we want to keep
library(dplyr)
library(tidyverse)
library(FactoMineR)
keep <- c("Tea", "How", "sugar", "sex", "age_Q", "breakfast", "work", "price")
#Next were are creating a new dataset with the selected columns
teal <- tea[, keep]
library(GGally)
library(ggplot2)
#Next we are going to do the summary and the graphical overview of the new data teal
summary(teal)
## Tea How sugar sex age_Q
## black : 74 alone:195 No.sugar:155 F:178 15-24:92
## Earl Grey:193 lemon: 33 sugar :145 M:122 25-34:69
## green : 33 milk : 63 35-44:40
## other: 9 45-59:61
## +60 :38
##
## breakfast work price
## breakfast :144 Not.work:213 p_branded : 95
## Not.breakfast:156 work : 87 p_cheap : 7
## p_private label: 21
## p_unknown : 12
## p_upscale : 53
## p_variable :112
gather(teal) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Next let’s do the multiple corrrespondence analysis
mca <- MCA(teal, graph = FALSE)
Summary of the model
summary(mca)
##
## Call:
## MCA(X = teal, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.219 0.185 0.182 0.162 0.154 0.138
## % of var. 9.737 8.231 8.068 7.191 6.831 6.137
## Cumulative % of var. 9.737 17.968 26.036 33.228 40.058 46.195
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.135 0.129 0.127 0.118 0.109 0.109
## % of var. 6.009 5.712 5.653 5.231 4.866 4.835
## Cumulative % of var. 52.203 57.916 63.568 68.799 73.666 78.501
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.098 0.092 0.089 0.071 0.068 0.065
## % of var. 4.366 4.102 3.947 3.173 3.035 2.876
## Cumulative % of var. 82.867 86.969 90.916 94.089 97.124 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.135 0.028 0.004 | 0.235 0.099 0.012 | 0.882 1.428
## 2 | 0.561 0.479 0.162 | 0.751 1.014 0.290 | 0.022 0.001
## 3 | -0.086 0.011 0.005 | 0.591 0.629 0.239 | -0.418 0.321
## 4 | -0.462 0.325 0.192 | -0.469 0.395 0.198 | -0.087 0.014
## 5 | 0.122 0.023 0.011 | 0.262 0.124 0.052 | -0.029 0.002
## 6 | -0.291 0.129 0.033 | -0.279 0.140 0.031 | -0.188 0.065
## 7 | 0.061 0.006 0.002 | 0.309 0.172 0.058 | 0.143 0.038
## 8 | 0.507 0.392 0.115 | 0.556 0.556 0.138 | 0.057 0.006
## 9 | 0.408 0.254 0.069 | 0.340 0.208 0.048 | 0.615 0.695
## 10 | 0.808 0.993 0.280 | 0.220 0.087 0.021 | 0.407 0.304
## cos2
## 1 0.163 |
## 2 0.000 |
## 3 0.120 |
## 4 0.007 |
## 5 0.001 |
## 6 0.014 |
## 7 0.012 |
## 8 0.001 |
## 9 0.156 |
## 10 0.071 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.097 16.950 0.394 10.859 | 0.336 1.881 0.037
## Earl Grey | -0.532 10.397 0.511 -12.360 | 0.056 0.135 0.006
## green | 0.652 2.666 0.052 3.962 | -1.080 8.663 0.144
## alone | -0.081 0.243 0.012 -1.908 | -0.127 0.712 0.030
## lemon | -0.121 0.092 0.002 -0.736 | -0.564 2.366 0.039
## milk | 0.053 0.034 0.001 0.471 | 0.566 4.545 0.085
## other | 1.829 5.723 0.103 5.561 | 0.866 1.519 0.023
## No.sugar | 0.462 6.300 0.228 8.265 | 0.403 5.664 0.174
## sugar | -0.494 6.734 0.228 -8.265 | -0.431 6.055 0.174
## F | -0.049 0.080 0.003 -1.015 | 0.288 3.323 0.121
## v.test Dim.3 ctr cos2 v.test
## black 3.326 | 0.337 1.930 0.037 3.335 |
## Earl Grey 1.296 | -0.085 0.317 0.013 -1.965 |
## green -6.566 | -0.261 0.516 0.008 -1.587 |
## alone -3.002 | -0.349 5.462 0.227 -8.232 |
## lemon -3.431 | 0.555 2.333 0.038 3.374 |
## milk 5.048 | 0.781 8.825 0.162 6.965 |
## other 2.634 | 0.066 0.009 0.000 0.200 |
## No.sugar 7.205 | -0.319 3.620 0.109 -5.703 |
## sugar -7.205 | 0.341 3.870 0.109 5.703 |
## F 6.016 | -0.559 12.787 0.457 -11.685 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.526 0.158 0.040 |
## How | 0.107 0.135 0.241 |
## sugar | 0.228 0.174 0.109 |
## sex | 0.003 0.121 0.457 |
## age_Q | 0.550 0.332 0.313 |
## breakfast | 0.000 0.173 0.055 |
## work | 0.097 0.325 0.056 |
## price | 0.241 0.063 0.181 |
Visualizing the MCA
plot(mca, invisible = c("ind"), habillage = "quali")
My interpretation of the MCA above: The distance between the different points gives a measure of their similarity. Younger peole (age from 15 to 24 and from 25 to 34) likes to have tea with sugar and at other time than breakfast. And middle age people, age from 35 to 44 and from 45 to 59 prefer to drink tea without sugar.